Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upChange Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289
Conversation
greebie
added some commits
Nov 7, 2018
This comment has been minimized.
This comment has been minimized.
codecov-io
commented
Nov 7, 2018
•
Codecov Report
@@ Coverage Diff @@
## master #289 +/- ##
==========================================
+ Coverage 70.36% 73.07% +2.71%
==========================================
Files 41 42 +1
Lines 1046 1170 +124
Branches 192 205 +13
==========================================
+ Hits 736 855 +119
- Misses 244 246 +2
- Partials 66 69 +3
Continue to review full report at Codecov.
|
ianmilligan1
requested changes
Nov 7, 2018
The GEXF generated by this works fine, but the GraphML won't work in Gephi. I used this script:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph.asGraphml(links, "/Users/ianmilligan1/desktop/links-for-gephi.graphml")
I tested the same ARCs/WARCs in 0.17.0 and the GraphML works when I created it there with WriteGraphML
, so it's something in the new function.
Here's the error in Gephi:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[387,35]
Message: Element type "edge" must be followed by either attribute specifications, ">" or "/>".
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
greebie
added some commits
Nov 7, 2018
ianmilligan1
approved these changes
Nov 8, 2018
Tested and the GraphML and GEXF files both work now.
One additional advantage: in Gephi, you sometimes have to provide the ID identifier (to create an ego network, for example). Right now, that involves copy and pasting a long hash which isn't trivial in their UI (requires a few clicks to copy-and-paste). Now, you could perhaps remember a three or four digit number and use accordingly.
Anyways, I think this is a good approach to have them in parallel at least for a little bit longer. And we can talk when we have a chance for a standup about future of the three functions.
This comment has been minimized.
This comment has been minimized.
Just a few additional notes here. ExtractGraphX still uses hash and not .zipWithUniqueId(). It's possible that this graph generation approach is slightly faster due to optimization (when not calculating PageRank etc.). I'm going to explore more now that I'm coming closer to the end of term with teaching. |
This comment has been minimized.
This comment has been minimized.
Would that be in a separate PR or would it potentially affect this one? |
This comment has been minimized.
This comment has been minimized.
I think it should be a separate PR, possibly referencing a different issue. I've been thinking about ExtractGraphX as a way to reduce the problems with GraphPass, but it's possible we could see some small efficiency gains just for regular aut production. I'd like to test that out. I just wanted to include the note in this PR since the id approach is still the old one. |
ianmilligan1
requested review from
ruebot and
lintool
Nov 8, 2018
This comment has been minimized.
This comment has been minimized.
ruebot
referenced this pull request
Nov 12, 2018
Open
Update aut documentation for https://github.com/archivesunleashed/aut/pull/289 #71
lintool
approved these changes
Nov 15, 2018
minor suggestions, address and merge
I don't need to see it again.
greebie
added some commits
Nov 16, 2018
This comment has been minimized.
This comment has been minimized.
I got a failure running this on rho: import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/geocities/warcs/indexed/GEOCITIES-20091029185858-00184-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/home/i2millig/results/new-gephi.gexf") Did syntax change in the revisions or is there an issue? Error log is here but the tl;dr is:
The same file works with the older |
This comment has been minimized.
This comment has been minimized.
It looks like the problem is that on converting the edge tuples to ids, I did not escape the xml first. That's why the wonky |
This comment has been minimized.
This comment has been minimized.
Am sending the WARC to you via Slack, @greebie. As we've now had a number of fails on this, could you also do a close line-by-line read of your new |
This comment has been minimized.
This comment has been minimized.
The bigger problem is that WriteGEXFTest and WriteGraphmlTest did not test for unescaped xml in the result. It tested the xml escaping function, but not that it was applied in the result. All these errors should be caught at compile time. WriteGraphTest now tests for escaped xml in the result, which should prevent this from happening again. |
This comment has been minimized.
This comment has been minimized.
Ran the same code on the new graph and it worked (after a java heap space error, but that's not related to this). I also loaded the graph into gephi and it worked with no errors. |
This comment has been minimized.
This comment has been minimized.
OK worked here too. Can you confirm that there's nothing missing in this |
This comment has been minimized.
This comment has been minimized.
I reviewed the code and it works as expected, based on my knowledge of the circumstances. I have one more push, however, to include some additional tests to cover potential future problems. |
This comment has been minimized.
This comment has been minimized.
Ok ready for me to review again on your end @greebie? |
This comment has been minimized.
This comment has been minimized.
Yes. The tests should cover most of the situations now. |
ianmilligan1
approved these changes
Nov 21, 2018
Looks good and works well. Tested on a large collection with:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/tuna1/scratch/i2milligan/results/new-gephi.gexf")
and then ran again but swapped out last line with:
WriteGraph.asGraphml(links, "/tuna1/scratch/i2milligan/results/new-gephi.graphml")
Same results as using the original WriteGEXF(links, "/tuna1/scratch/i2milligan/results/new-gephi-old-command.gexf")
command but without the long node IDs, more legible now.
Over to @ruebot for final review.
This comment has been minimized.
This comment has been minimized.
@greebie are you done adding code to this PR? We've tested this multiple times, done code review, and signed off, and things keep on getting added. |
This comment has been minimized.
This comment has been minimized.
Done. Last commit was an attempt to improve coverage a bit. |
greebie commentedNov 7, 2018
•
edited
GitHub issue(s):
#243
What does this Pull Request do?
This PR adds WriteGraph which generates GEXF and/or Graphml files.
There are also some additional utilities such as an id lookup.
This object generates proper unique ids. This differs from the current method in WriteGraphML and WriteGEXF that simply creates ids using an MD5 hash of the url. This method is better because the latter method could produce incorrect graphs in the case where hashes collide (eg. with very large graphs).
Timing tests with a medium-sized graph shows that it increases the processing time by 10-15%. However, it could reduce the size of network graph derivatives by an unknown margin (because numeric ids are smaller than hashes).
Example:
Old way produces:
New way:
How should this be tested?
Should produce a Gexf that opens in Gephi.
Change last line to
WriteGraph.asGraphml(links, "new-gephi3.gexf")
to get graphml.I have not tested it in GraphPass yet, but there is no reason why it should not work as directed.
Additional Notes:
WriteGraphml
would replace WriteGEXF & WriteGraphml which can be deprecated.CommandLineApp
would also have to be changed before WriteGEXF etc. are removed.WriteGraph
udf which was deprecated and removed.Interested parties
@lintool @ruebot @ianmilligan1
Thanks in advance for your help with the Archives Unleashed Toolkit!