Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a number of additional app extractors. #451

Merged
merged 2 commits into from Apr 21, 2020
Merged

Add a number of additional app extractors. #451

merged 2 commits into from Apr 21, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Apr 21, 2020

GitHub issue(s): #447

What does this Pull Request do?

Add a number of additional app extractors.

  • Resolves #447
  • Add AudioInformationExtractor, ImageInformationExtractor,
    PDFInformationExtractor, PresentationProgramInformationExtractor,
    SpreadsheetInformationExtractor, TextFilesInformationExtractor,
    VideoInformationExtractor, WebGraphExtractor,
    WordProcessorInformationExtractor
  • Add tests for the new extractors
  • Update CommandLineApp to use new extractors
  • Add domain, and language column WebPagesExtractor
  • Change "TEXT" to "csv"
  • Lower case "GEXF" and "GRAPHML"

How should this be tested?

  • TravisCi
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/AudioInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/ImageInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PDFInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PDFInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PresentationProgramInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PresentationProgramInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor SpreadsheetInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/SpreadsheetInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor TextFilesInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/TextFilesInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor VideoInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/VideoInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WordProcessorInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WordProcessorInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor

Additional Notes:

  1. I just added WebGraphExtractor as an additional option, since it is slightly different than the csv output of DomainGraphExtractor
  2. I tweaked WebPagesExtractor to produce similar, and more enhanced output that PlainTextExtractor. We might want to consider removing PlainTextExtractor in the future
  3. For all the binary extractors, I only added the binary information extractor. Before we add the binary extractor, or binary + binary information (the full DataFrame), we should talk it out a bit more, and do some testing with csv output.
- Resolves #447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Add domain, and language column WebPagesExtractor
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
@ruebot ruebot requested review from lintool and ianmilligan1 Apr 21, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 21, 2020

I'll get an associated documentation PR opened up later today.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 21, 2020

Codecov Report

Merging #451 into master will increase coverage by 2.17%.
The diff coverage is 98.58%.

@@            Coverage Diff             @@
##           master     #451      +/-   ##
==========================================
+ Coverage   74.55%   76.72%   +2.17%     
==========================================
  Files          40       49       +9     
  Lines        1285     1422     +137     
  Branches      246      264      +18     
==========================================
+ Hits          958     1091     +133     
- Misses        211      215       +4     
  Partials      116      116              
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Apr 21, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 21, 2020

Documentation PR: archivesunleashed/aut-docs#57

Copy link
Member

ianmilligan1 left a comment

Worked nicely!

Note that the example command bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor should have been WebGraphExtractor but I don't think that affects anything. Just in case the PR text is used in the future for any testing or copy-and-pasting.

@ianmilligan1 ianmilligan1 merged commit f1eb43b into master Apr 21, 2020
3 checks passed
3 checks passed
codecov/patch 98.58% of diff hit (target 74.55%)
Details
codecov/project 76.72% (+2.17%) compared to 17ac324
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the issue-447 branch Apr 21, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 21, 2020

Oh, sorry. That was copypasta on my part.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Apr 21, 2020

Heh no worries @ruebot - it was actually good to see robust error messages.

20/04/21 16:28:11 ERROR CommandLineApp: WebGraphInformationExtractor not supported. The following extractors are supported:
20/04/21 16:28:11 ERROR CommandLineApp: PDFInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: TextFilesInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: ImageGraphExtractor
20/04/21 16:28:11 ERROR CommandLineApp: WebPagesExtractor
20/04/21 16:28:11 ERROR CommandLineApp: ImageInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: WordProcessorInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: SpreadsheetInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: VideoInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: WebGraphExtractor
20/04/21 16:28:11 ERROR CommandLineApp: AudioInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: PresentationProgramInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: DomainGraphExtractor
20/04/21 16:28:11 ERROR CommandLineApp: DomainFrequencyExtractor
20/04/21 16:28:11 ERROR CommandLineApp: PlainTextExtractor
ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this pull request Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.