Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification around network diagrams #275

Closed
ruebot opened this Issue Mar 25, 2019 · 20 comments

Comments

Projects
None yet
4 participants
@ruebot
Copy link
Member

ruebot commented Mar 25, 2019

During the Team Kompromat presentation at the DC Datathon, @edsu noted that the network diagrams can be misleading. One could assume that the network diagram represents what is in the archive itself that was analyzed. We should clarify that this is not the case. So, what's the best place to do it? A note on the diagram, something in the documentation? Something else?

@ruebot ruebot added the ux label Mar 25, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 25, 2019

Hmm. Maybe a note in the documentation as well as a hover-over question mark icon to display some help text like we do with the derivatives?

@greebie

This comment has been minimized.

Copy link
Contributor

greebie commented Mar 25, 2019

Could we be specific about what is the case (we capture every domain and create an edge for every link we find in the web page)? That is a limitation of the network graphs, since I think people imagine the archives to contain everything in Way Back. (That would be really nice, of course!)

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 25, 2019

Just that what is being visualized is the domains that are captured as well as the domains that they link to (which may or may not be in the actual web archived collection).

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 25, 2019

What about something like this?

Screen Shot 2019-03-25 at 11 04 28 AM

@greebie

This comment has been minimized.

Copy link
Contributor

greebie commented Mar 25, 2019

That works for me!

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Mar 25, 2019

@ianmilligan1 I like that!

@edsu does that work?

@edsu

This comment has been minimized.

Copy link

edsu commented Mar 25, 2019

Thanks for hearing this part of the presentation, and dropping it in here. You guys are awesome. I like the explanation.

I guess I was imagining (at least) two different types of users of this view.

  • Archivists might like to see what was linked to but not crawled, because it could help them building their collections.
  • Researchers who are trying to understand the content might not care too much about what was archived, and are more interested in seeing the relationships regardless of whether they were crawled. Although I guess seeing what was not crawled could help inform other visualizations, like text analysis, etc.

Maybe it would need to be two views? It would be nice if the underlying derivative Gephi file had a property indicating whether it was crawled or not. Then it could be easy for people to examine...

@greebie

This comment has been minimized.

Copy link
Contributor

greebie commented Mar 25, 2019

Adding a "crawled" or "domain"=1 attribute to the gexf would not be too expensive or difficult. Might be worth considering something in the sigmaJS to indicate a crawl as well (change the text size and/or colour? or the node shape?).

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 27, 2019

My inclination is to not overcomplicate the sigmaJS, as we're fairly limited in what we can add there.

In terms of a crawled attribute or something, that makes sense. Or maybe just noting which nodes are origins, in which case they're part of the crawl, as opposed to destinations?

@edsu

This comment has been minimized.

Copy link

edsu commented Mar 27, 2019

That makes sense to not over complicate the sigmaJS. Maybe I'm going out on a limb, but I think most archivists would want to see what is actually in the collection, rather than a mixture of what is there and what isn't. We have similar quandries in DocNow where we have vis elements that ought to behave slightly differently based on the audience (researcher vs archivist).

I think the edge already has a source and a target in the gexf file. A target could be the source of another edge though. Perhaps this isn't simple, and would require post-processing the graph...

Just out of curiosity does the SIgmaJS data get created as a artifact of the processing pipeline? Or are the gephi files in some way used to generate it?

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 27, 2019

I think that's probably true, but as you've noted, researchers will also want to see what isn't there as well. So I think providing options in the Gephi file might be a good compromise here?

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Mar 27, 2019

@edsu #146 is a long running issue I've been working on porting from Python to Ruby. It's awful. I might just make exec calls out to the Python script and call it a day. Anyway, I think that visualization would be what you're looking for.

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Mar 27, 2019

Just out of curiosity does the SIgmaJS data get created as a artifact of the processing pipeline? Or are the gephi files in some way used to generate it?

The SigmaJs viz comes from the gexf file that is created by GraphPass during the derivative generation pipeline. If you want me to point you some of the code, let me know.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 27, 2019

Ahhh, good point @ruebot re #146, that would give a good sense of what's in the collection (as opposed to relying on the network diagram to see).

@edsu

This comment has been minimized.

Copy link

edsu commented Mar 27, 2019

I think y'all should feel like you can close this ticket. Especially since it seems like #146 covers a known issue.

@greebie

This comment has been minimized.

Copy link
Contributor

greebie commented Mar 27, 2019

The #146 does seem like the better option. After looking at AUT a bit, it may be slightly more difficult to include the crawled=True attributes than I thought. The main issue is the many approaches to networks we take in aut means we'd have to make changes in multiple places (Hashed vs proper ids; graphx vs flatmap over tuples; gexf vs graphml etc.) or risk inconsistent outputs.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 27, 2019

Heh thanks @edsu but I wouldn't sell this issue short.. I think at a minimum we should add some helper text explaining the visualization, and I do like the idea of letting people filter in the Gephi file.

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Mar 27, 2019

Thanks @edsu!

@ianmilligan1 want to put in a PR with your work when you get a chance, then I'll start working on #146 again.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 27, 2019

Sounds good! (timeline all depends on whether I make the standby list on a flight this afternoon or not.. heh) Thanks @edsu @ruebot @greebie for your thoughts on this important ticket.

@ruebot ruebot closed this in #277 Mar 27, 2019

ruebot added a commit that referenced this issue Mar 27, 2019

Explaining graph visualization, partially resolves #275 (#277)
* Explaining graph viz, partially resolves #275
* Fleshes out the Gephi files documentation as well
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Mar 27, 2019

Deployed #277 to production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.