Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove RDD option in app; DataFrame only now. #450

Merged
merged 3 commits into from Apr 20, 2020
Merged

Remove RDD option in app; DataFrame only now. #450

merged 3 commits into from Apr 20, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Apr 20, 2020

GitHub issue(s): #449

What does this Pull Request do?

Remove RDD option in app; DataFrame only now.

  • Resolves #449
  • Updates and renames tests were applicable

I'll get an associated documentation PR with this as well.

How should this be tested?

  • TravisCI

If you want to robust, the following:

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainGraphExtractorGRAPHML --output-format GRAPHML;
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainGraphExtractorGEXF --output-format GEXF
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainGraphExtractorTEXT
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainFrequencyExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/ImageGraphExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/PlainTextExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebPagesExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/WebPagesExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainFrequencyExtractorSingle --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/ImageGraphExtractorSingle --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/PlainTextExtractorSingle --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebPagesExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/WebPagesExtractorSingle --partition 1

Should produce the following:

[nruest@bomba:449-test]$ tree .
.
├── DomainFrequencyExtractor
│   ├── part-00000-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00001-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00002-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00003-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00004-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00005-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00006-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00007-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00008-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00009-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00010-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00011-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00012-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00013-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00014-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00015-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00016-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00017-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00018-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00019-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00020-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   └── _SUCCESS
├── DomainFrequencyExtractorSingle
│   ├── part-00000-804dacc4-932c-44ea-b10e-66430f8f3a45-c000.csv
│   └── _SUCCESS
├── DomainGraphExtractorGEXF
│   └── GEXF.gexf
├── DomainGraphExtractorGRAPHML
│   └── GRAPHML.graphml
├── DomainGraphExtractorTEXT
│   ├── part-00000-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00001-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00002-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00003-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00004-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00005-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00006-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00007-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00008-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00009-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00010-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00011-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00012-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00013-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00014-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00015-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00016-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00017-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00018-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00019-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00020-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00021-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00022-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00023-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00024-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00025-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00026-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00027-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00028-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00029-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00030-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00031-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00032-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00033-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00034-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00035-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00036-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00037-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00038-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00039-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00040-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00041-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00042-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00043-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00044-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00045-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00046-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00047-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00048-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00049-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00050-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00051-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00052-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00053-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00054-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00055-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00056-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00057-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00058-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00059-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00060-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00061-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00062-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00063-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00064-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   └── _SUCCESS
├── ImageGraphExtractor
│   ├── part-00000-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00001-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00002-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00003-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00004-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00005-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00006-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00007-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00008-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00009-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   └── _SUCCESS
├── ImageGraphExtractorSingle
│   ├── part-00000-005dd922-dc35-46ca-b9d3-c3184637e1db-c000.csv
│   └── _SUCCESS
├── PlainTextExtractor
│   ├── part-00000-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00001-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00002-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00003-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00004-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00005-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00006-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00007-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00008-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00009-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   └── _SUCCESS
├── PlainTextExtractorSingle
│   ├── part-00000-e982ea04-0176-4070-a739-6532aef2edba-c000.csv
│   └── _SUCCESS
├── WebPagesExtractor
│   ├── part-00000-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00001-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00002-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00003-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00004-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00005-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00006-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00007-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00008-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00009-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   └── _SUCCESS
└── WebPagesExtractorSingle
    ├── part-00000-db87b0bc-5761-4f8c-bd79-92dbaf41d0fd-c000.csv
    └── _SUCCESS

11 directories, 131 files
- Resolves #449
- Updates and renames tests were applicable
@ruebot ruebot requested review from lintool and ianmilligan1 Apr 20, 2020
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 20, 2020

Codecov Report

Merging #450 into master will decrease coverage by 1.00%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master     #450      +/-   ##
==========================================
- Coverage   75.55%   74.55%   -1.01%     
==========================================
  Files          40       40              
  Lines        1395     1285     -110     
  Branches      265      246      -19     
==========================================
- Hits         1054      958      -96     
+ Misses        218      211       -7     
+ Partials      123      116       -7     
ruebot added 2 commits Apr 20, 2020
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Apr 20, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 20, 2020

Documentation PR: archivesunleashed/aut-docs#56

Copy link
Member

ianmilligan1 left a comment

Tested with both TravisCI and the provided code snippets - they all work very well. Great stuff!

@ianmilligan1 ianmilligan1 merged commit 17ac324 into master Apr 20, 2020
2 of 3 checks passed
2 of 3 checks passed
codecov/project 74.55% (+-1.01%) compared to d5a0433
Details
codecov/patch 88.88% of diff hit (target 75.55%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the issue-449 branch Apr 20, 2020
ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this pull request Apr 20, 2020
#56)

* Documentation updates for archivesunleashed/aut#450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.