Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code formatting and code consistency review. #42

Merged
merged 2 commits into from Feb 5, 2020
Merged

Code formatting and code consistency review. #42

merged 2 commits into from Feb 5, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Feb 4, 2020

This probably does it. I did a pretty heavy pass across each file. Though, I could have missed things. Let me know if I did.

- Resolves #29
@ruebot ruebot requested review from lintool, ianmilligan1 and SamFritz Feb 4, 2020
Copy link
Member

SamFritz left a comment

Looks great @ruebot! I found one little change, everything else looks consistent to me :)

.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(r.getContentString)))
.saveAsTextFile("plain-text-rdd/")
```

If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.
If you wanted to use it on your own collection, you would change "src/test/resources/arc//path/to/warcs" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 5, 2020

Member

there is an extra / before path/to/warcs

Copy link
Member

ianmilligan1 left a comment

One super minor catch; then it's ready to merge.

```
spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.18.1-SNAPSHOT"
```shell
$ spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.18.1-SNAPSHOT"

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 5, 2020

Member

We'll want to make sure to keep updating the --packages call in each version of the docs?

This comment has been minimized.

Copy link
@ruebot

ruebot Feb 5, 2020

Author Member

Yep. Shouldn't be that big of deal.

.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(r.getContentString)))
.saveAsTextFile("plain-text-rdd/")
```

If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.
If you wanted to use it on your own collection, you would change "src/test/resources/arc//path/to/warcs" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 5, 2020

Member

src/test/resources/arc//path/to/warcs > /path/to/warcs

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Feb 5, 2020

@SamFritz @ianmilligan1 updated. Just removed that line, since it's not really necessary.

@ianmilligan1 ianmilligan1 merged commit 245aab4 into master Feb 5, 2020
@ianmilligan1 ianmilligan1 deleted the issue-29 branch Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.