Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd a number of additional app extractors. #451
Conversation
- Resolves #447 - Add AudioInformationExtractor, ImageInformationExtractor, PDFInformationExtractor, PresentationProgramInformationExtractor, SpreadsheetInformationExtractor, TextFilesInformationExtractor, VideoInformationExtractor, WebGraphExtractor, WordProcessorInformationExtractor - Add tests for the new extractors - Update CommandLineApp to use new extractors - Add domain, and language column WebPagesExtractor - Change "TEXT" to "csv" - Lower case "GEXF" and "GRAPHML"
This comment has been minimized.
This comment has been minimized.
I'll get an associated documentation PR opened up later today. |
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Apr 21, 2020
•
Codecov Report
@@ Coverage Diff @@
## master #451 +/- ##
==========================================
+ Coverage 74.55% 76.72% +2.17%
==========================================
Files 40 49 +9
Lines 1285 1422 +137
Branches 246 264 +18
==========================================
+ Hits 958 1091 +133
- Misses 211 215 +4
Partials 116 116 |
This comment has been minimized.
This comment has been minimized.
Documentation PR: archivesunleashed/aut-docs#57 |
Worked nicely! Note that the example command |
This comment has been minimized.
This comment has been minimized.
Oh, sorry. That was copypasta on my part. |
This comment has been minimized.
This comment has been minimized.
Heh no worries @ruebot - it was actually good to see robust error messages.
|
ruebot commentedApr 21, 2020
GitHub issue(s): #447
What does this Pull Request do?
Add a number of additional app extractors.
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
How should this be tested?
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/AudioInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/ImageInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PDFInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PDFInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PresentationProgramInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PresentationProgramInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor SpreadsheetInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/SpreadsheetInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor TextFilesInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/TextFilesInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor VideoInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/VideoInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WordProcessorInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WordProcessorInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor
Additional Notes:
WebGraphExtractor
as an additional option, since it is slightly different than thecsv
output ofDomainGraphExtractor
WebPagesExtractor
to produce similar, and more enhanced output thatPlainTextExtractor
. We might want to consider removingPlainTextExtractor
in the futurecsv
output.