Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289

greebie · Nov 7, 2018

GitHub issue(s):

What does this Pull Request do?

This PR adds WriteGraph which generates GEXF and/or Graphml files.
There are also some additional utilities such as an id lookup.

This object generates proper unique ids. This differs from the current method in WriteGraphML and WriteGEXF that simply creates ids using an MD5 hash of the url. This method is better because the latter method could produce incorrect graphs in the case where hashes collide (eg. with very large graphs).

Timing tests with a medium-sized graph shows that it increases the processing time by 10-15%. However, it could reduce the size of network graph derivatives by an unknown margin (because numeric ids are smaller than hashes).

Example:
Old way produces:

<node id="405d19a958ba43d88e9edd7a77338aa3" label="laws.justice.gc.ca" />
<node id="69bde2cbb357119a50a950fca99a8341" label="english.uvic.ca" />
<node id="8687e4dc11548a5917504975feb7c649" label="oipc.bc.ca" />
<node id="30192bae759dafccc58bccc268c2b411" label="accuweather.com" />

New way:

<node id="0" label="laws.justice.gc.ca" />
<node id="38" label="english.uvic.ca" />
<node id="76" label="oipc.bc.ca" />
<node id="114" label="accuweather.com" />

How should this be tested?

Travis should pass.

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "new-gephi3.gexf") }

Should produce a Gexf that opens in Gephi.

Change last line to WriteGraph.asGraphml(links, "new-gephi3.gexf") to get graphml.

I have not tested it in GraphPass yet, but there is no reason why it should not work as directed.

Additional Notes:

WriteGraphmlwould replace WriteGEXF & WriteGraphml which can be deprecated.
CommandLineApp would also have to be changed before WriteGEXF etc. are removed.
I think aut had a previous WriteGraph udf which was deprecated and removed.
This applies only to the RDD graph functions. DF functions are not changed. (I need to review the way DFs do unique ids)
I added a node id lookup tool during the process of developing WriteGraph. It might be useful for the toolkit, but it can also be removed.

Interested parties

@lintool @ruebot @ianmilligan1

Thanks in advance for your help with the Archives Unleashed Toolkit!

codecov-io · Nov 7, 2018

Codecov Report

Merging #289 into master will increase coverage by 2.71%.
The diff coverage is 95.96%.

@@            Coverage Diff             @@
##           master     #289      +/-   ##
==========================================
+ Coverage   70.36%   73.07%   +2.71%     
==========================================
  Files          41       42       +1     
  Lines        1046     1170     +124     
  Branches      192      205      +13     
==========================================
+ Hits          736      855     +119     
- Misses        244      246       +2     
- Partials       66       69       +3

Impacted Files	Coverage Δ
...ain/scala/io/archivesunleashed/app/WriteGEXF.scala	`100% <ø> (ø)`	⬆️
...in/scala/io/archivesunleashed/app/WriteGraph.scala	`95.96% <95.96%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6080a7...dbee737. Read the comment docs.

ianmilligan1 · Nov 7, 2018

ianmilligan1 requested changes Nov 7, 2018

View changes

The GEXF generated by this works fine, but the GraphML won't work in Gephi. I used this script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph.asGraphml(links, "/Users/ianmilligan1/desktop/links-for-gephi.graphml")

I tested the same ARCs/WARCs in 0.17.0 and the GraphML works when I created it there with WriteGraphML, so it's something in the new function.

Here's the error in Gephi:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[387,35]
Message: Element type "edge" must be followed by either attribute specifications, ">" or "/>".
	at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

ianmilligan1 · Nov 8, 2018

ianmilligan1 approved these changes Nov 8, 2018

View changes

Tested and the GraphML and GEXF files both work now.

One additional advantage: in Gephi, you sometimes have to provide the ID identifier (to create an ego network, for example). Right now, that involves copy and pasting a long hash which isn't trivial in their UI (requires a few clicks to copy-and-paste). Now, you could perhaps remember a three or four digit number and use accordingly.

Anyways, I think this is a good approach to have them in parallel at least for a little bit longer. And we can talk when we have a chance for a standup about future of the three functions.

greebie · Nov 8, 2018

Just a few additional notes here.

ExtractGraphX still uses hash and not .zipWithUniqueId(). It's possible that this graph generation approach is slightly faster due to optimization (when not calculating PageRank etc.). I'm going to explore more now that I'm coming closer to the end of term with teaching.

ianmilligan1 · Nov 8, 2018

I'm going to explore more now that I'm coming closer to the end of term with teaching.

Would that be in a separate PR or would it potentially affect this one?

greebie · Nov 8, 2018

I think it should be a separate PR, possibly referencing a different issue. I've been thinking about ExtractGraphX as a way to reduce the problems with GraphPass, but it's possible we could see some small efficiency gains just for regular aut production. I'd like to test that out.

I just wanted to include the note in this PR since the id approach is still the old one.

ianmilligan1 · Nov 8, 2018

OK sounds good @greebie. I'll let @lintool and @ruebot review this. From a user perspective, works for me.

lintool · Nov 15, 2018