Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUpdate Documentation for 0.18.0 #77
Conversation
ianmilligan1
requested review from
greebie and
SamFritz
Nov 28, 2018
This comment has been minimized.
This comment has been minimized.
ruebot
reviewed
Nov 28, 2018
We should probably tweak it a bit here to provide two different examples. |
content/aut/index.md
Outdated
|
||
### Location of the Resource in ARCs and WARCs | ||
|
||
Finally, you may want to know what WARC file the different resources are located in! The following command will list the WARC file that each URL is found in. |
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 28, 2018
Member
The following command will provide the full path and filename of the ARC/WARC that each url is found in.
.map(r => (r.getUrl, r.getArchiveFilename)) | ||
.take(10) | ||
``` | ||
|
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 28, 2018
Member
Or, if you just want to know the filename, not the full path and filename, ....
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import org.apache.commons.io.FilenameUtils
RecordLoader.loadArchives("example.arc.gz", sc)
.keepValidPages()
.map(r => (r.getUrl, FilenameUtils.getName(r.getArchiveFilename)))
.saveAsTextFile("/path/to/output")
This comment has been minimized.
This comment has been minimized.
Additions are great and looks good locally. Nice work!! |
SamFritz
approved these changes
Nov 29, 2018
greebie
reviewed
Dec 13, 2018
content/aut/index.md
Outdated
import io.archivesunleashed.matchbox._ | ||
val r = RecordLoader.loadArchives("example.arc.gz", sc) | ||
.keepValidPages() |
This comment has been minimized.
This comment has been minimized.
greebie
Dec 13, 2018
Collaborator
I think the results will mostly be 200 if you include .keepValidPages(), so it might be fine to disinclude that here.
greebie
approved these changes
Dec 13, 2018
This was referenced Aug 9, 2019
This comment has been minimized.
This comment has been minimized.
Documentation now lives on a GitHub wiki here. |
ianmilligan1
closed this
Sep 11, 2019
ruebot
deleted the
0.18.0-updates
branch
Sep 11, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
ianmilligan1 commentedNov 28, 2018
•
edited
Do not merge until 0.18.0 is released
This PR adds documentation for:
getHttpStatus
(#74)getArchiveFilename
(#74)WriteGraph
(includingGraphML
option) (#71)I've tested them all using the
example.arc.gz
file, and we have also tested each PR.