Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add imagegraph, and webgraph to command line app. #432

Merged
merged 7 commits into from Apr 7, 2020
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Apr 6, 2020

GitHub issue(s): #431

What does this Pull Request do?

Add imagegraph, and webgraph to command line app.
- Resolves #431
- Adds webpages, and imagegraph to command line app
- Adds tests for new functionality
- Clean-up doc comments
- Convert files with dos line endings to unix line endings

How should this be tested?

  • TravisCI

ImageGraphExtractor example:

  • bin/spark-submit --master local\[8\] --files /home/nruest/Projects/au/sample-data/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/home/nruest/Projects/au/sample-data/log4j.properties' --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor ImageGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/ImageGraphExtractor --df

WebPagesExtractor example:

  • bin/spark-submit --master local\[8\] --files /home/nruest/Projects/au/sample-data/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/home/nruest/Projects/au/sample-data/log4j.properties' --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor WebPagesExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/WebPagesExtractor --df

Additional Notes:

  1. There is the side issue of the log4j config file required. I'll create a ticket for that, and work on a solution separately.
  2. I also a separate issue, I noticed that DomainGraphExtractor writes as GEXF. Anybody remember why this using GEXF? Shouldn’t it be graphml? https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/CommandLineApp.scala#L115-L123
  3. I haven't added webgraph and domains because there is a bit of duplication with DomainFrequencyExtractor and DomainGraphExtractor , and there could also be some confusion here how the extractors are labeled.
  4. Once we sort out a bit of the above, I'll contiue working on the documentation update branch I have locally for archivesunleashed/aut-docs#14
ruebot added 6 commits Feb 10, 2020
- Resolves #431
- Adds webpages, and imagegraph to command line app
- Adds tests for new functionality
- Clean-up doc comments
- Convert files with dos line endings to unix line endings
@ruebot ruebot requested review from lintool and ianmilligan1 Apr 6, 2020
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 6, 2020

Codecov Report

Merging #432 into master will increase coverage by 0.28%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master     #432      +/-   ##
==========================================
+ Coverage   77.70%   77.99%   +0.28%     
==========================================
  Files          41       43       +2     
  Lines        1534     1554      +20     
  Branches      282      286       +4     
==========================================
+ Hits         1192     1212      +20     
  Misses        217      217              
  Partials      125      125              
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 7, 2020

#433 has been created (log4j configuration issue).

@ruebot ruebot marked this pull request as ready for review Apr 7, 2020
Copy link
Member

ianmilligan1 left a comment

Tested locally on sample data and works like a charm (I borrowed the log4.properties file from the other PR).

I suspect it generates GEXF as I think the work on this pre-dated our shift from GEXF to graphml as the main graph output.

@ianmilligan1 ianmilligan1 merged commit 771ea82 into master Apr 7, 2020
3 checks passed
3 checks passed
codecov/patch 96.29% of diff hit (target 77.70%)
Details
codecov/project 77.99% (+0.28%) compared to 92b5f2d
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the issue-431 branch Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.