Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add imagegraph, and webgraph to command line app. #432

Draft
wants to merge 6 commits into
base: master
from
Draft

Conversation

@ruebot
Copy link
Member

ruebot commented Apr 6, 2020

GitHub issue(s): #431

What does this Pull Request do?

Add imagegraph, and webgraph to command line app.
- Resolves #431
- Adds webpages, and imagegraph to command line app
- Adds tests for new functionality
- Clean-up doc comments
- Convert files with dos line endings to unix line endings

How should this be tested?

  • TravisCI

ImageGraphExtractor example:

  • bin/spark-submit --master local\[8\] --files /home/nruest/Projects/au/sample-data/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/home/nruest/Projects/au/sample-data/log4j.properties' --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor ImageGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/ImageGraphExtractor --df

WebPagesExtractor example:

  • bin/spark-submit --master local\[8\] --files /home/nruest/Projects/au/sample-data/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/home/nruest/Projects/au/sample-data/log4j.properties' --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor WebPagesExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/WebPagesExtractor --df

Additional Notes:

  1. There is the side issue of the log4j config file required. I'll create a ticket for that, and work on a solution separately.
  2. I also a separate issue, I noticed that DomainGraphExtractor writes as GEXF. Anybody remember why this using GEXF? Shouldn’t it be graphml? https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/CommandLineApp.scala#L115-L123
  3. I haven't added webgraph and domains because there is a bit of duplication with DomainFrequencyExtractor and DomainGraphExtractor , and there could also be some confusion here how the extractors are labeled.
  4. Once we sort out a bit of the above, I'll contiue working on the documentation update branch I have locally for archivesunleashed/aut-docs#14
ruebot added 6 commits Feb 10, 2020
- Resolves #431
- Adds webpages, and imagegraph to command line app
- Adds tests for new functionality
- Clean-up doc comments
- Convert files with dos line endings to unix line endings
@ruebot ruebot requested review from lintool and ianmilligan1 Apr 6, 2020
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 6, 2020

Codecov Report

Merging #432 into master will decrease coverage by 0.22%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master     #432      +/-   ##
==========================================
- Coverage   77.70%   77.47%   -0.23%     
==========================================
  Files          41       43       +2     
  Lines        1534     1554      +20     
  Branches      282      286       +4     
==========================================
+ Hits         1192     1204      +12     
- Misses        217      225       +8     
  Partials      125      125              
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

1 participant
You can’t perform that action at this time.