Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upCode formatting and code consistency review. #42
Merged
+180
−188
Conversation
current/text-analysis.md
Outdated
.keepValidPages() | ||
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(r.getContentString))) | ||
.saveAsTextFile("plain-text-rdd/") | ||
``` | ||
|
||
If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. | ||
If you wanted to use it on your own collection, you would change "src/test/resources/arc//path/to/warcs" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. |
This comment has been minimized.
This comment has been minimized.
One super minor catch; then it's ready to merge. |
``` | ||
spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.18.1-SNAPSHOT" | ||
```shell | ||
$ spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.18.1-SNAPSHOT" |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Feb 5, 2020
Member
We'll want to make sure to keep updating the --packages
call in each version of the docs?
This comment has been minimized.
This comment has been minimized.
current/text-analysis.md
Outdated
.keepValidPages() | ||
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(r.getContentString))) | ||
.saveAsTextFile("plain-text-rdd/") | ||
``` | ||
|
||
If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. | ||
If you wanted to use it on your own collection, you would change "src/test/resources/arc//path/to/warcs" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@SamFritz @ianmilligan1 updated. Just removed that line, since it's not really necessary. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
ruebot commentedFeb 4, 2020
This probably does it. I did a pretty heavy pass across each file. Though, I could have missed things. Let me know if I did.