Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document command line app #14

Closed
ruebot opened this issue Oct 26, 2019 · 2 comments
Closed

Document command line app #14

ruebot opened this issue Oct 26, 2019 · 2 comments

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Oct 26, 2019

We have no to little documentation (other than doc comments) of the command line app https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/app

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Oct 26, 2019

FYI from AUT PR #236:

DataFrame implementation (--df flag)

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output1 --df 

Partition (combining all fies together) (--partition flag)

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output2 --df  --partition 1

Output will be a single file rather than PART-0000, PART-0001, etc.

Each W/ARC to their own directory (--split flag)

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output3 --df  --split

I can't completely remember the context of why this was done vs. just loading in scripts?

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Oct 28, 2019

It was this one: archivesunleashed/aut#195. Makes it a lot easier to use spark-submit.

I wasn't paying much attention at the time since I was heads down on auk, so I don't recall ever really putting anything through its paces with spark-submit.

ruebot added a commit that referenced this issue Apr 7, 2020
- Resolves #14
- Documents archivesunleashed/aut#431
ianmilligan1 pushed a commit that referenced this issue Apr 7, 2020
- Resolves #14
- Documents archivesunleashed/aut#431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants
You can’t perform that action at this time.