New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add save to wayback function(s). (Issue #2) #3

Merged
merged 11 commits into from Jan 26, 2019

Conversation

Projects
None yet
2 participants
@greebie
Copy link
Collaborator

greebie commented Jan 11, 2019

This PR adds save to wayback functionality.

It can do a single url

r <- save_wayback("qdr.syr.edu")
r
 $url
 [1] "https://qdr.syr.edu/"

$archived_snapshots
$archived_snapshots$closest
$archived_snapshots$closest$status
[1] "200"

$archived_snapshots$closest$available
[1] TRUE

$archived_snapshots$closest$url
[1] "http://web.archive.org/web/20190111181645/https://qdr.syr.edu/"

$archived_snapshots$closest$timestamp
[1] "20190111181645"

or it can save a list of urls:

r <- archiv(c("qdr.syr.edu", "notaurl", "www.google.com"))

                    url status available?
1  https://qdr.syr.edu/    200       TRUE
2               notaurl    000      FALSE
3 http://www.google.com    200       TRUE
                                                      wayback_url
1  http://web.archive.org/web/20190111190523/https://qdr.syr.edu/
2                                                   url not found
3 http://web.archive.org/web/20190111190630/http://www.google.com
       timestamp
1 20190111190523
2        unknown
3 20190111190630

@greebie greebie requested a review from adam3smith Jan 11, 2019

@adam3smith
Copy link
Contributor

adam3smith left a comment

Saving to WBM works flawlessly. Also tested archiving a page with NOARCHIV in robots and script returned appropriate 403 message

@adam3smith

This comment has been minimized.

Copy link
Contributor

adam3smith commented Jan 15, 2019

Still testing saving urls from files

@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Jan 15, 2019

Fingers crossed I am starting issue #5 now. Getting there!

Show resolved Hide resolved archivr.R Outdated
@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Jan 17, 2019

I included the relaxed regex, but it seems that I get false negatives in docx (haven't tried with others). For example, from my syllabus, I got "http://gutenberg.ca/ebooks/innis-minerva/innis-minerva-00-h.html.Turner" suggesting that somewhere along the line the script is not detecting \n etc.

Issues #7, #8 and #11 are resolved in the last commit.

@@ -220,7 +308,7 @@ set_api_key <- function (key) {
#'
#' @param url The url to extract urls.
#' @return a vector of urls.
get_urls_from_webpage <- function (url) {
extract_urls_from_webpage <- function (url) {

This comment has been minimized.

@adam3smith

adam3smith Jan 17, 2019

Contributor

this function doesn't extract internal/relative links (because of the startsWith command). I think generally thats the right choice. We should
a) either document it (I think this is probably the better option) or
b) add an option to this that it would inherit from archiv.fromURL

What do you think?

This comment has been minimized.

@greebie

greebie Jan 17, 2019

Collaborator

I think documenting would be better for now. If all things work properly, I think fixing the README will be part of the PR that turns this into a package. (Packaging will require some serious refactoring of the code).

Show resolved Hide resolved archivr.R Outdated
Show resolved Hide resolved archivr.R Outdated
@adam3smith

This comment has been minimized.

Copy link
Contributor

adam3smith commented Jan 26, 2019

@greebie -- I think we're good to merge this PR and turn this into a package. I'll let you handle the merge. Thanks!

@greebie greebie merged commit d43e594 into master Jan 26, 2019

@greebie greebie deleted the issue-2 branch Jan 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment